A few weeks ago, I got around to participating in TidyTuesday for the first time. The data set for that week (2019-12-03) contained data related to parking violations in the city of Philadelphia during 2017.
The plots I made and the code for each are below.
Plot 1: Histogram of ticket issue times
Plot 2: Totals per day details
Plot 3: Tickets per day by zip code
I just realized that I did not share the code for any of the plots that I made so I will write it up here.
First, I:
read in the data using read_csv {readr}
use head {utils} to look at the first few rows of the data
library(readr)
df <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-03/tickets.csv")
head(df)
## # A tibble: 6 x 7
## violation_desc issue_datetime fine issuing_agency lat lon zip_code
## <chr> <dttm> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 BUS ONLY ZONE 2017-12-06 12:29:00 51 PPA 40.0 -75.1 19149
## 2 STOPPING PROHIB… 2017-10-16 18:03:00 51 PPA 40.0 -75.2 19127
## 3 OVER TIME LIMIT 2017-11-02 22:09:00 26 PPA 40.0 -75.2 19127
## 4 OVER TIME LIMIT 2017-11-05 20:19:00 26 PPA 40.0 -75.2 19127
## 5 STOP PROHIBITED… 2017-10-17 06:58:00 76 PPA 40.0 -75.2 19102
## 6 DOUBLE PARKED 2017-10-02 10:40:00 51 POLICE 40.0 -75.1 NA
Next, I make the following changes to the data:
separate the issue_datetime column into Date and Time columns using separate {tidyr}
use mutate {dplyr} to:
library(dplyr)
library(tidyr)
library(lubridate)
df <- separate(df, issue_datetime, into = c("Date", "Time"), sep = " ") %>%
arrange(Date, Time) %>% ungroup %>% mutate(Date = ymd(Date),
day = weekdays(Date))
The first 6 rows of df now contain the following information:
| violation_desc | Date | Time | fine | issuing_agency | lat | lon | zip_code | day |
|---|---|---|---|---|---|---|---|---|
| HP RESERVED SPACE | 2017-01-01 | 00:00:00 | 301 | POLICE | 39.99941 | -75.14233 | 19133 | Sunday |
| CORNER CLEARANCE | 2017-01-01 | 00:00:00 | 51 | POLICE | 39.95505 | -75.23199 | 19143 | Sunday |
| FIRE HYDRANT | 2017-01-01 | 00:00:00 | 76 | POLICE | 39.97776 | -75.24400 | 19151 | Sunday |
| BLOCKING DRIVEWAY | 2017-01-01 | 00:28:00 | 51 | POLICE | 39.97362 | -75.16394 | 19121 | Sunday |
| FIRE HYDRANT | 2017-01-01 | 00:41:00 | 76 | POLICE | 40.03015 | -75.08998 | 19124 | Sunday |
| DOUBLE PARKED CC | 2017-01-01 | 00:50:00 | 76 | POLICE | 39.94893 | -75.16093 | NA | Sunday |
For this plot, I create new variables to:
I created a data frame with the offset for each day:
day <- c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")
offset <- (0:6)*24
daylookup <- data.frame(day, offset)
daylookup
## day offset
## 1 Monday 0
## 2 Tuesday 24
## 3 Wednesday 48
## 4 Thursday 72
## 5 Friday 96
## 6 Saturday 120
## 7 Sunday 144
To assign each ticket to one of the 168 hours of the week, I carried out the following steps:
df_hms <- separate(df, Time, into = c("hr", "minute", "second")) %>%
mutate(hr = as.numeric(hr)) %>% left_join(daylookup) %>%
mutate(incident_hr = offset+hr)
Now, I can make the plot using ggplot {ggplot2} using a histogram geometry.
library(ggplot2)
library(scales)
ggplot(data = df_hms, aes(x = incident_hr)) +
geom_histogram(fill = "royalblue", colour = "black", bins = 168) +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(label = comma) + xlab("") + ylab("") + theme_bw() +
ggtitle("Philadelphia Parking Tickets by Time of Issue (2017)") +
theme(plot.title = element_text(hjust = 0.5))
Then I annotated the plot by adding:
ggplot(data = df_hms, aes(x = incident_hr)) +
geom_histogram(fill = "royalblue", colour = "black", bins = 168) +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(label = comma) + xlab("") + ylab("") + theme_bw() +
ggtitle("Philadelphia Parking Tickets by Time of Issue (2017)") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(xintercept = 23.5, linetype = "dashed") +
geom_vline(xintercept = 47.5, linetype = "dashed") +
geom_vline(xintercept = 71.5, linetype = "dashed") +
geom_vline(xintercept = 95.5, linetype = "dashed") +
geom_vline(xintercept = 119.5, linetype = "dashed") +
geom_vline(xintercept = 143.5, linetype = "dashed") +
annotate("text", x = 12, y = 27000, label = "Mon") +
annotate("text", x = 36, y = 27000, label = "Tue") +
annotate("text", x = 60, y = 27000, label = "Wed") +
annotate("text", x = 84, y = 27000, label = "Thu") +
annotate("text", x = 110, y = 27000, label = "Fri") +
annotate("text", x = 134, y = 27000, label = "Sat") +
annotate("text", x = 158, y = 27000, label = "Sun")
As background work to create this plot, I:
df$newday <- ifelse(df$Date == lag(df$Date), 0, 1)
df$newday <- ifelse(is.na(df$newday), 1, df$newday)
df$daynum <- cumsum(df$newday)
df$week <- ((df$daynum - 1) %/% 7) + 1
Next I did a few more data manipulation operations:
aggregating total tickets per day over the week and day variables
make day a factor with levels corresponding to the days of the week (Monday to Sunday)
agg_week <- group_by(df, week, day) %>% summarise(Tickets = n())
aw <- agg_week
aw$day <- factor(aw$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))
Now to make a ggplot {ggplot2} with * with a line geometry * faceted on the week variable
library(ggplot2)
library(scales)
wkday_plt <- ggplot(data = aw, aes(x = week, y = Tickets)) + geom_line() +
facet_wrap(~day, scales="free_x") + theme_minimal() +
ggtitle("Philadelphia Parking Violations (2017)") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(label = comma) + xlab("") + ylab("")
wkday_plt
After making the chart, I realized that I could have made essentially the same chart simply by using Date as the x variable input. The only difference would be that the x-axis values would be dates instead of the week numbers. (I remove the dates on the x-axis anyway below.)
df_agg_day <- group_by(df, Date, day) %>% summarise(Tickets = n()) %>% ungroup()
df_agg_day$day <- factor(df_agg_day$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))
wkday_plt2 <- ggplot(data = df_agg_day, aes(x = Date, y = Tickets)) + geom_line() +
facet_wrap(~day) + theme_minimal() +
ggtitle("Philadelphia Parking Violations (2017)") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank()) +
scale_y_continuous(label = comma) + xlab("") + ylab("")
wkday_plt2
In the time since I originally made this plot, I have been using the plotly library. I think this is a good opportunity to demonstrate how easily it allows you to make an interactive chart for the web.
It’s really very simple:
create a ggplot
use the plot created as input to ggplotly
library(plotly)
ggplotly(wkday_plt2)
You can now hover over the chart and see the total tickets issued on a specific day and see the exact date. It’s a lot of functionality for very little effort.
The key steps for making this plot are:
finding the average number of tickets per day in each zip code
putting the zip codes into groups
merging the numerical data to geospatial data corresponding to the shapes of the different zip codes represented in the data
use group_by {dplyr} and summarise {dplyr} to find the average number of tickets per day. (And don’t forget to ungroup with ungroup {dply} when you are done.)
add a new variable, Tickets per day to bucket the zip codes into 5 ranges
df_zip <- group_by(df, zip_code) %>%
summarise(Tickets = n()/365) %>%
ungroup() %>%
mutate(`Tickets per day` = ifelse(Tickets < 10, "< 10",
ifelse(Tickets < 100, "10 to 99",
ifelse(Tickets < 200, "100 to 199",
ifelse(Tickets < 300, "200 to 299", "300+")))))
I really should learn how to avoid so many nested ifelse statements. I’m sure R – or a widely-used package – has a case statement.
I was able to get a shapefile for the zip codes of Philadelphia from OpenDataPhilly.
To read it in, I used the readOGR {rgdal} function which reads the shapefile into a Spatial vector object.
library(rgdal) # will load sp package as well
philly <- readOGR("Zipcodes_Poly/Zipcodes_Poly.shp")
Looking at the created object, philly in the Viewer window we can see that the CODE field (contained in data) contains the 48 zip codes corresponding to the 48 polygons in polygons.
Next we can merge the data frame, df_zip (containing the average tickets per day for each zip code) with the Spatial object, philly (containing the specification of the polygons for the zip codes).
We can do this using merge {sp}. I load the sf package as I will be using another function from it immediately after.
library(sf)
philly <- merge(philly, df_zip, by.x = "CODE",
by.y = "zip_code")
Now we convert philly to an sf (Simple Features) object which will work very easily with ggplot {ggplot2}.
phi_sf <- st_as_sf(philly)
Actually making the plot with ggplot is very simple: just use the geom_sf {ggplot2} to use a simple features geometry and indicate that Tickets per day is the variable we want to use as the fill color for each zip code.
library(ggplot2)
ggplot() + geom_sf(data = phi_sf, aes(fill = `Tickets per day` )) + theme_minimal() +
ggtitle("Philadelphia Parking Violations (2017)") +
theme(plot.title = element_text(hjust = 0.5))
Strictly speaking, these per day averages may be incorrect as approximately 14% of the tickets in the raw data have no corresponding zip code. So these averages are really lower bounds for 2017.
So that’s it. The plots I made for TidyTuesday on 2019-12-03. Until next time.